| built_age | count |
|:------------|--------:|
| Pre-30s | 36523 |
| Post-30s | 30464 |
Building a classifier to predict house age
Your task
The objective of the investigation is to determine whether energy performance data can be used to estimate the age category of residential houses (pre-1930 or post-1930). The work entails applying supervised machine learning approaches to actual EPC data, assessing model performance, and interpreting the findings in a relevant manner.
This work exhibits proficiency in data exploration, classification modelling, model assessment, and clear exposition of statistical findings.
The Problem
Energy Performance Certificate (EPC) data give extensive information on how energy is utilized in houses across the United Kingdom. For this study, EPC data from March 2025 were used. Although EPCs may not represent a completely random sample of all dwellings, they do include a wide variety of data relating to energy efficiency, carbon emissions, and estimated operating costs. This makes them ideal for practical data science and modelling activities.
A significant problem in house decarbonisation is that structures built before 1930 sometimes require different retrofit options than modern dwellings. Older homes often have solid walls, weak insulation, and inefficient heating systems, whereas newer homes have cavity walls, increased glazing, and greater thermal efficiency.Being able to identify these age groups automatically would be highly beneficial for local governments with limited finances.
The purpose of this study is to determine if a property’s age category (pre-1930 or post-1930) can be reliably predicted using only a few energy-related indicators. The predictors utilised include current energy efficiency, environmental effect, energy consumption, carbon emissions, and predicted lighting, heating, and hot water prices. All observations are from a single local authority, which ensures that the model reflects variance within an area rather than across various areas.
The Data
A very simple preview of the data is given below:
And a clumsy presentation of some group summary statistics:
The dataset consists of 36,523 properties from before 1930 and 30,464 properties from after 1930. While the groups are not completely balanced, each have a sufficient amount of data to provide effective classification models.
| Unnamed: 0 | Post-30s | Pre-30s |
|:---------------------------------|-----------:|----------:|
| current_energy_efficiency_count | 30464 | 36523 |
| current_energy_efficiency_mean | 63.564 | 51.656 |
| current_energy_efficiency_std | 11.028 | 13.456 |
| current_energy_efficiency_min | 1 | 1 |
| current_energy_efficiency_25% | 58 | 46 |
| current_energy_efficiency_50% | 66 | 54 |
| current_energy_efficiency_75% | 71 | 60 |
| current_energy_efficiency_max | 103 | 96 |
| environment_impact_current_count | 30464 | 36523 |
| environment_impact_current_mean | 60.501 | 47.304 |
| environment_impact_current_std | 11.904 | 12.503 |
| environment_impact_current_min | 1 | 1 |
| environment_impact_current_25% | 53 | 40 |
| environment_impact_current_50% | 62 | 48 |
| environment_impact_current_75% | 69 | 55 |
| environment_impact_current_max | 105 | 109 |
| energy_consumption_current_count | 30464 | 36523 |
| energy_consumption_current_mean | 272.404 | 374.525 |
| energy_consumption_current_std | 103.206 | 133.378 |
| energy_consumption_current_min | 0 | -64 |
| energy_consumption_current_25% | 205 | 290 |
| energy_consumption_current_50% | 254 | 353 |
| energy_consumption_current_75% | 319 | 423 |
| energy_consumption_current_max | 2333 | 2173 |
| co2_emissions_current_count | 30464 | 36523 |
| co2_emissions_current_mean | 4.342 | 6.782 |
| co2_emissions_current_std | 2.086 | 4.388 |
| co2_emissions_current_min | -0.5 | -1.1 |
| co2_emissions_current_25% | 3 | 4.2 |
| co2_emissions_current_50% | 3.8 | 5.8 |
| co2_emissions_current_75% | 5.1 | 8.1 |
| co2_emissions_current_max | 50 | 211 |
| lighting_cost_current_count | 30464 | 36523 |
| lighting_cost_current_mean | 90.082 | 91.844 |
| lighting_cost_current_std | 40.153 | 43.339 |
| lighting_cost_current_min | 15 | 10 |
| lighting_cost_current_25% | 64 | 63 |
| lighting_cost_current_50% | 82 | 83 |
| lighting_cost_current_75% | 107 | 111 |
| lighting_cost_current_max | 1973 | 994 |
| heating_cost_current_count | 30464 | 36523 |
| heating_cost_current_mean | 830.955 | 1303.34 |
| heating_cost_current_std | 493.637 | 952.761 |
| heating_cost_current_min | 66 | 176 |
| heating_cost_current_25% | 529 | 760 |
| heating_cost_current_50% | 702 | 1070 |
| heating_cost_current_75% | 976 | 1553 |
| heating_cost_current_max | 12972 | 46573 |
| hot_water_cost_current_count | 30464 | 36523 |
| hot_water_cost_current_mean | 142.224 | 143.522 |
| hot_water_cost_current_std | 90.175 | 111.191 |
| hot_water_cost_current_min | 34 | 0 |
| hot_water_cost_current_25% | 94 | 93 |
| hot_water_cost_current_50% | 110 | 107 |
| hot_water_cost_current_75% | 156 | 134 |
| hot_water_cost_current_max | 1879 | 2260 |
The grouped summary data show an obvious and persistent disparity between the two age groupings. Pre-1930 properties had much lower mean energy efficiency ratings and greater average energy usage and carbon emissions than post-1930 properties. Heating expenses are especially different, with older houses having far higher average values and significantly greater fluctuation. Environmental impact scores follow a similar trend, with pre-1930 houses doing significantly worse.
These constant variations across several energy-related variables clearly suggest that the predictors include useful information, allowing supervised classification algorithms to discriminate between pre- and post-1930 features with good accuracy.
It is always useful to view a scatter plot of the data, marking the two known groups
<iframe src=“../../data_cache/vignettes/supervised_classification/scatterplot.html”{width=“100%” height=“600px”}>
The scatter plot indicates that pre-1930 homes cluster in areas with greater energy use and heating expenses. Lower values of these variables show a greater concentration of post-1930 features. While there is considerable overlap between the groupings, the general separation indicates that classification models should be capable of providing decent prediction performance.
Fitting LDA and QDA classifiers
Two standard supervised classification techniques were used: Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA). LDA generates a linear decision boundary under the assumption that both groups have a same covariance structure. QDA modifies this assumption by enabling each group to have its own covariance matrix, resulting in a nonlinear border.
Results from a linear discriminant analysis
| Category | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| Post-30s | 0.734 | 0.689 | 0.711 | 9140 |
| Pre-30s | 0.753 | 0.792 | 0.772 | 10957 |
| accuracy | 0.745 | 0.745 | 0.745 | 0.745 |
| macro avg | 0.744 | 0.74 | 0.741 | 20097 |
| weighted avg | 0.745 | 0.745 | 0.744 | 20097 |
The LDA model has an overall classification accuracy of about 74.5%. Both age groups are predicted with comparable precision and recall, implying balanced performance. The F1-scores for both classes are more than 0.70, indicating that the model works consistently for both pre-1930 and post-1930 features. Overall, LDA is an effective baseline classifier for this problem.
Results from fitting a quadratic discriminant analysis
| Category | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| Post-30s | 0.561 | 0.883 | 0.686 | 9140 |
| Pre-30s | 0.813 | 0.424 | 0.557 | 10957 |
| accuracy | 0.633 | 0.633 | 0.633 | 0.633 |
| macro avg | 0.687 | 0.654 | 0.622 | 20097 |
| weighted avg | 0.699 | 0.633 | 0.616 | 20097 |
The QDA model has a lower overall accuracy of about 63.3%. While it excels at detecting post-1930 homes, it does badly on pre-1930 properties, with significantly worse recall. This implies that many older homes are mistakenly labelled as newer. Although QDA enables for more flexible decision bounds, it appears to over fit some areas of the data and performs worse than LDA.
ROC curve (and AUC) for two simple classifiers
The ROC curves for both classifiers are presented below:
The Receiver Operating Characteristic (ROC) curves were used to assess the classifiers’ ability to discriminate between pre- and post-1930 characteristics at all classification thresholds. The Area Under the Curve (AUC) is a single, summary measure of overall model performance.
The Linear Discriminant Analysis (LDA) model has an AUC of 0.81, and the Quadratic Discriminant Analysis (QDA) model had a fairly comparable AUC of 0.80. Both results are much higher than 0.5, showing that the models outperform random categorization.
The slightly higher AUC for the LDA model indicates that it has somewhat greater overall discriminative capacity than the QDA model.This is consistent with previous classification measures, in which LDA obtained greater accuracy and more balanced precision and recall across the two age groups. Overall, LDA produces the most consistent results for this classification job.
Record of AI use for MTHM503 supervised coursework
Instructions: You can use this document to record when, how and why you used GenAI to complete your assessment. It will help you create a record of AI use to submit alongside your references for AI-integrated and AI-assisted assignments. It may also be useful to help you discuss your AI use if you are required to do so in an academic conduct meeting.
| Date | AI tool used | Purpose | Prompt | Hyperlink to output (where possible) | Section of work used for |
|---|---|---|---|---|---|
| 10/12/2025 | ChatGPT | To improve structure and academic tone | “Can you help me improve the structure and check the academic tone?” | N/A | Multiple Sections |
| 10/12/2025 | ChatGPT | To improve grammar and academic wording | “Check the paragraph for any grammatical errors and see if the words are suitable for an academic tone” | N/A | Multiple Sections |